Reading Mass Spec data files

You can read several files at the same time using the merger module. But for now, we will read only a single file.

Reading a CSV file and get descriptive information about the mass spec data using the module msdas.readers.MassSpecReader



In [1]:

    
from msdas import *
from msdas import yeast
%pylab inline









    



Couldn't import dot_parser, loading of dot files will not be possible.
Populating the interactive namespace from numpy and matplotlib

Here are some files to play with. These are 6 files that should be merger. However, we can read them one by one for demonstration.

Reading the data (36 columns of experiments + 7 of metadata)



In [2]:

    
y = MassSpecReader(yeast.get_yeast_small_data())









    



INFO:root:Reading /home/cokelaer/Work/github/msdas/share/data/YEAST_small_all.csv
INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- Removing 0 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost

Calling the print function allows to get some basic information about the number of rows/protein/peptides



In [3]:

    
print(y)









    



This dataframe contains 40 columns in addition to the standard columns Protein,
Sequence, Psite
Your data contains 23 unique proteins
Your data contains 57 combination of psites/proteins

The data is contained in the data frame called df. Aliasese (read-only) to the measurements only, or to the metadata only are available in the dataframs called measurements and metadata respectively.



In [4]:

    
y.df.ix[[0,1,2]]









    Out[4]:






  
    
      
      Protein
      Sequence
      Psite
      Sequence_Phospho
      a0_t0
      a0_t1
      a0_t5
      a0_t10
      a0_t20
      a0_t45
      ...
      a20_t45
      a45_t0
      a45_t1
      a45_t5
      a45_t10
      a45_t20
      a45_t45
      Entry
      Entry_name
      Identifier
    
  
  
    
      0
      DIG1
      DGNLASSNSAHFPPVANQNVK
      S126+S127
      DGNLAS(Phospho)SNSAHFPPVANQNVK
      0.000415
      0.000397
      0.000671
      0.000602
      0.000440
      0.000418
      ...
      0.001009
      0.000289
      0.000300
      0.000270
      0.000358
      0.000367
      0.000313
      Q03063
      DIG1_YEAST
      DIG1_S126+S127
    
    
      1
      DIG1
      SAPAQVTQHSK
      S142
      S(Phospho)APAQVTQHSK
      0.000187
      0.000185
      0.000267
      0.000202
      0.000138
      0.000226
      ...
      0.001241
      0.001144
      0.001364
      0.001237
      0.001091
      0.001425
      0.001707
      Q03063
      DIG1_YEAST
      DIG1_S142
    
    
      2
      DIG1
      VNDSYDSPLSGTASTGK
      S272
      VNDSYDS(Phospho)PLSGTASTGK
      0.000338
      0.000330
      0.000538
      0.000505
      0.000381
      0.000328
      ...
      0.000232
      0.000178
      0.000208
      0.000122
      0.000212
      0.000203
      0.000221
      Q03063
      DIG1_YEAST
      DIG1_S272
    
  

3 rows × 43 columns



In [5]:

    
y.measurements.ix[[0,1,2]]









    Out[5]:






  
    
      
      a0_t0
      a0_t1
      a0_t5
      a0_t10
      a0_t20
      a0_t45
      a1_t0
      a1_t1
      a1_t5
      a1_t10
      ...
      a20_t5
      a20_t10
      a20_t20
      a20_t45
      a45_t0
      a45_t1
      a45_t5
      a45_t10
      a45_t20
      a45_t45
    
  
  
    
      0
      0.000415
      0.000397
      0.000671
      0.000602
      0.000440
      0.000418
      0.001416
      0.001090
      0.000775
      0.000668
      ...
      0.001149
      0.000917
      0.000902
      0.001009
      0.000289
      0.000300
      0.000270
      0.000358
      0.000367
      0.000313
    
    
      1
      0.000187
      0.000185
      0.000267
      0.000202
      0.000138
      0.000226
      0.000177
      0.000162
      0.000136
      0.000142
      ...
      0.001135
      0.000899
      0.001064
      0.001241
      0.001144
      0.001364
      0.001237
      0.001091
      0.001425
      0.001707
    
    
      2
      0.000338
      0.000330
      0.000538
      0.000505
      0.000381
      0.000328
      0.000367
      0.000379
      0.000330
      0.000372
      ...
      0.000349
      0.000319
      0.000314
      0.000232
      0.000178
      0.000208
      0.000122
      0.000212
      0.000203
      0.000221
    
  

3 rows × 36 columns



In [6]:

    
y.metadata.ix[[0,1,2]]









    Out[6]:






  
    
      
      Identifier
      Protein
      Sequence
      Sequence_Phospho
      Psite
      Entry
      Entry_name
    
  
  
    
      0
      DIG1_S126+S127
      DIG1
      DGNLASSNSAHFPPVANQNVK
      DGNLAS(Phospho)SNSAHFPPVANQNVK
      S126+S127
      Q03063
      DIG1_YEAST
    
    
      1
      DIG1_S142
      DIG1
      SAPAQVTQHSK
      S(Phospho)APAQVTQHSK
      S142
      Q03063
      DIG1_YEAST
    
    
      2
      DIG1_S272
      DIG1
      VNDSYDSPLSGTASTGK
      VNDSYDS(Phospho)PLSGTASTGK
      S272
      Q03063
      DIG1_YEAST

Statistics about Phospho sites



In [7]:

    
y.plot_phospho_stats()

Histogram peptide length



In [8]:

    
y.hist_peptide_sequence_length()

Visulalisation variation in each experiment



In [9]:

    
y.boxplot() # variation in each experiment









    



/home/cokelaer/Work/virtualenv/lib64/python2.7/site-packages/pandas/tools/plotting.py:2633: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  warnings.warn(msg, FutureWarning)

Visualise data as a "time series"



In [10]:

    
y.plot_timeseries('DIG1_S272')









    Out[10]:






  
    
      
      a0_t0
      a0_t1
      a0_t5
      a0_t10
      a0_t20
      a0_t45
      a1_t0
      a1_t1
      a1_t5
      a1_t10
      ...
      a20_t5
      a20_t10
      a20_t20
      a20_t45
      a45_t0
      a45_t1
      a45_t5
      a45_t10
      a45_t20
      a45_t45
    
  
  
    
      2
      0.000338
      0.00033
      0.000538
      0.000505
      0.000381
      0.000328
      0.000367
      0.000379
      0.00033
      0.000372
      ...
      0.000349
      0.000319
      0.000314
      0.000232
      0.000178
      0.000208
      0.000122
      0.000212
      0.000203
      0.000221
    
  

1 rows × 36 columns

Visualise data in a 6 by 6 image (YEAST case only)



In [11]:

    
y.plot_experiments("DIG1_S272")









    



WARNING:root:Works with yeast data set only






    Out[11]:






  
    
      
      0
      1
      2
      3
      4
      5
    
  
  
    
      a0
      0.000338
      0.000330
      0.000538
      0.000505
      0.000381
      0.000328
    
    
      a1
      0.000367
      0.000379
      0.000330
      0.000372
      0.000415
      0.000270
    
    
      a5
      0.000505
      0.000465
      0.000502
      0.000473
      0.000418
      0.000493
    
    
      a10
      0.000521
      0.000550
      0.000538
      0.000629
      0.000478
      0.000460
    
    
      a20
      0.000243
      0.000312
      0.000349
      0.000319
      0.000314
      0.000232
    
    
      a45
      0.000178
      0.000208
      0.000122
      0.000212
      0.000203
      0.000221

Creating an instance from an existing instance



In [12]:

    
y2 = readers.MassSpecReader(y, verbose=False)



In [13]:

    
y2 == y









    Out[13]:





True

Reading data with empty instance of MassSpecReader



In [14]:

    
y3 = readers.MassSpecReader(verbose=True)
filename = yeast.get_yeast_small_data()
y3.read_csv(filename)









    



INFO:root:Reading /home/cokelaer/Work/github/msdas/share/data/YEAST_small_all.csv



In [15]:

    
y3 == y









    Out[15]:





False

Here, the data read using the function read_csv seems to be different. Indeed, when reading a file normally, the cleanup function is called automatically. So, you have to call the cleanup function manually:



In [16]:

    
y3.cleanup()









    



INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- Removing 0 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost



In [17]:

    
y3 == y









    Out[17]:





True



In [ ]:

	Protein	Sequence	Psite	Sequence_Phospho	a0_t0	a0_t1	a0_t5	a0_t10	a0_t20	a0_t45	...	a20_t45	a45_t0	a45_t1	a45_t5	a45_t10	a45_t20	a45_t45	Entry	Entry_name	Identifier
0	DIG1	DGNLASSNSAHFPPVANQNVK	S126+S127	DGNLAS(Phospho)SNSAHFPPVANQNVK	0.000415	0.000397	0.000671	0.000602	0.000440	0.000418	...	0.001009	0.000289	0.000300	0.000270	0.000358	0.000367	0.000313	Q03063	DIG1_YEAST	DIG1_S126+S127
1	DIG1	SAPAQVTQHSK	S142	S(Phospho)APAQVTQHSK	0.000187	0.000185	0.000267	0.000202	0.000138	0.000226	...	0.001241	0.001144	0.001364	0.001237	0.001091	0.001425	0.001707	Q03063	DIG1_YEAST	DIG1_S142
2	DIG1	VNDSYDSPLSGTASTGK	S272	VNDSYDS(Phospho)PLSGTASTGK	0.000338	0.000330	0.000538	0.000505	0.000381	0.000328	...	0.000232	0.000178	0.000208	0.000122	0.000212	0.000203	0.000221	Q03063	DIG1_YEAST	DIG1_S272